First, we import all the needed librairies.


In [1]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import json
import math
import time
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

1. Top-ranking universities

We noticed that the actual data from topuniversities is not directly on the webpage, but on a separate text file, in JSON format. Thus, we first get this JSON, parse it, and take the first 200 entries in it. We noticed that the univertsity with rank 199 is actually the 198th entry, and thus the last 3 universities needs to have their rank corrected.


In [2]:
r = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508259845358')
raw_data = json.loads(r.text)['data'][:200]

We can print the first entry of the data to see how the informations are represented.


In [3]:
raw_data[0]


Out[3]:
{'cc': 'US',
 'core_id': '410',
 'country': 'United States',
 'guide': '<a href="/where-to-study/north-america/united-states/guide" class="guide-link" target="_blank">United States</a>',
 'logo': '<img src="https://www.topuniversities.com/sites/default/files/massachusetts-institute-of-technology-mit_410_small_0.jpg" alt="Massachusetts Institute of Technology (MIT)  Logo">',
 'nid': '294850',
 'rank_display': '1',
 'region': 'North America',
 'score': '100',
 'stars': '6',
 'title': 'Massachusetts Institute of Technology (MIT)',
 'url': '/universities/massachusetts-institute-technology-mit'}

We can now define functions that will help us during the processing of this JSON.

First, process_university takes as input the raw JSON of a particular university, and outputs a dictionnary containing the name, rank, country, region, number of faculty members (international and total) and number of students (international and total) for that given unviversity.

It uses other functions defined below.


In [4]:
def process_university(uni):
    name = uni['title']
    rank = get_rank(uni['rank_display'])
    country = uni['country']
    region = uni['region']
    
    numbers = get_numbers(uni['url'])
    info = {'name' : name, 'rank' : rank, 'country' : country, 'region' : region}
    info.update(numbers)
    return info

As there can be ties in rank, the displayed rank is not always a integer. Furthermore, as said above, the last 3 universities have incorrect ranks and need to be fixed.


In [5]:
def get_rank(rank_display):
    rank = int(rank_display.replace("=", ""))
    if rank >= 199:
        rank -= 1
    return rank

To get the number of faculty members (international and total) and number of students (international and total), we need to get another request, and this time, we will need to parse the webpage using BeautifulSoup.

By inspecting the webpage, we noticed the classes of the elements where the numbers are contained. Once we get these elements, we further need to parse its content, to get the value as an integer.

During the parsing, we noticed that one university (NYU) did not have the same template as the others, and so its number of students is unknown.


In [6]:
def get_numbers(url):
    r = requests.get("https://www.topuniversities.com/" + url)
    soup = BeautifulSoup(r.text, 'html.parser')
    
    faculty_info = soup.select(".text .number")
    if len(faculty_info) >= 2:
        total_faculty = parse_int(faculty_info[0].decode_contents(formatter="html"))
        international_faculty = parse_int(faculty_info[1].decode_contents(formatter="html"))
    else:
        total_faculty = math.nan
        international_faculty = math.nan
    
    student_info = soup.select(".barp .number")
    if len(faculty_info) >= 2:
        total_student = parse_int(student_info[0].decode_contents(formatter="html"))
        international_student = parse_int(student_info[1].decode_contents(formatter="html"))
    else:
        total_student = math.nan
        international_student = math.nan
    return {'total_faculty' : total_faculty, 'international_faculty' : international_faculty, 'total_student' : total_student, 'international_student' : international_student}

In [7]:
def parse_int(str):
    return int(str.replace("\n", "").replace(" ", "").replace(",", ""))

We put the gathered and parsed data of the universities in a new JSON file for later reuse so that we don't have to generate new requests every time we run the notebook. Then we create a dataframe from this data and display it.


In [8]:
# Uncomment and run this if you want to regenerate the JSON
"""unis1 = []
for uni in raw_data:
    unis1.append(process_university(uni))
    
with open('data1.json', 'w') as f:
    json.dump(unis1, f)"""


Out[8]:
"unis1 = []\nfor uni in raw_data:\n    unis1.append(process_university(uni))\n    \nwith open('data1.json', 'w') as f:\n    json.dump(unis1, f)"

In [9]:
with open('data1.json', 'r') as f:
    unis1 = json.load(f)

In [10]:
df = pd.DataFrame(unis1)
df


Out[10]:
country international_faculty international_student name rank region total_faculty total_student
0 United States 1679.0 3717.0 Massachusetts Institute of Technology (MIT) 1 North America 2982.0 11067.0
1 United States 2042.0 3611.0 Stanford University 2 North America 4285.0 15878.0
2 United States 1311.0 5266.0 Harvard University 3 North America 4350.0 22429.0
3 United States 350.0 647.0 California Institute of Technology (Caltech) 4 North America 953.0 2255.0
4 United Kingdom 2278.0 6699.0 University of Cambridge 5 Europe 5490.0 18770.0
5 United Kingdom 2964.0 7353.0 University of Oxford 6 Europe 6750.0 19720.0
6 United Kingdom 2554.0 14854.0 UCL (University College London) 7 Europe 6345.0 31080.0
7 United Kingdom 2071.0 8746.0 Imperial College London 8 Europe 3930.0 16090.0
8 United States 635.0 3379.0 University of Chicago 9 North America 2449.0 13557.0
9 Switzerland 1886.0 7563.0 ETH Zurich - Swiss Federal Institute of Techno... 10 Europe 2477.0 19815.0
10 Singapore 2993.0 7251.0 Nanyang Technological University, Singapore (NTU) 11 Asia 4338.0 25738.0
11 Switzerland 1300.0 5896.0 Ecole Polytechnique Fédérale de Lausanne (EPFL) 12 Europe 1695.0 10343.0
12 United States 246.0 1793.0 Princeton University 13 North America 1007.0 8069.0
13 United States 970.0 5411.0 Cornell University 14 North America 2718.0 21904.0
14 Singapore 3086.0 8917.0 National University of Singapore (NUS) 15 Asia 5106.0 32728.0
15 United States 1708.0 2469.0 Yale University 16 North America 4940.0 12402.0
16 United States 1061.0 4105.0 Johns Hopkins University 17 North America 4462.0 16146.0
17 United States 913.0 8105.0 Columbia University 18 North America 6189.0 25045.0
18 United States 1383.0 4250.0 University of Pennsylvania 19 North America 5499.0 20639.0
19 Australia 927.0 5551.0 The Australian National University 20 Oceania 1600.0 14442.0
20 United States 1920.0 7527.0 University of Michigan 21 North America 6809.0 43147.0
21 United States 226.0 2864.0 Duke University 21 North America 2938.0 15320.0
22 United Kingdom 1553.0 10551.0 The University of Edinburgh 23 Europe 4075.0 28040.0
23 United Kingdom 1513.0 9451.0 King's College London 23 Europe 3650.0 23780.0
24 China 932.0 4072.0 Tsinghua University 25 Asia 5506.0 36300.0
25 Hong Kong 2085.0 8230.0 The University of Hong Kong 26 Asia 3012.0 20214.0
26 United States 1395.0 6086.0 University of California, Berkeley (UCB) 27 North America 3321.0 36703.0
27 Japan 254.0 2639.0 The University of Tokyo 28 Asia 4514.0 27279.0
28 United States 1141.0 3422.0 Northwestern University 28 North America 4231.0 18675.0
29 Hong Kong 835.0 2921.0 The Hong Kong University of Science and Techno... 30 Asia 1150.0 10375.0
... ... ... ... ... ... ... ... ...
170 Germany 413.0 3897.0 Albert-Ludwigs-Universitaet Freiburg 171 Europe 1966.0 23214.0
171 India 4.0 80.0 Indian Institute of Technology Delhi (IITD) 172 Asia 466.0 7477.0
172 Saudi Arabia 665.0 989.0 King Fahd University of Petroleum & Minerals 173 Asia 1062.0 6040.0
173 United Kingdom 717.0 7502.0 University of Liverpool 173 Europe 2390.0 22065.0
174 United States 487.0 1899.0 University of Virginia 173 North America 2509.0 21560.0
175 Australia 546.0 7769.0 University of Technology Sydney 176 Oceania 1328.0 24207.0
176 France 137.0 1132.0 CentraleSupélec 177 Europe 529.0 3877.0
177 United States 504.0 4597.0 University of Florida 178 North America 5419.0 45485.0
178 India 19.0 93.0 Indian Institute of Technology Bombay (IITB) 179 Asia 876.0 9402.0
179 Netherlands 299.0 2235.0 University of Twente 179 Europe 910.0 9332.0
180 Germany 641.0 3736.0 University of Göttingen 181 Europe 3678.0 30402.0
181 Belgium 515.0 1853.0 Vrije Universiteit Brussel (VUB) 182 Europe 1792.0 9284.0
182 Brazil 109.0 966.0 Universidade Estadual de Campinas (Unicamp) 182 Latin America 1968.0 26572.0
183 United States 566.0 2679.0 University of Colorado Boulder 182 North America 3685.0 30219.0
184 Austria 172.0 5199.0 Vienna University of Technology 182 Europe 706.0 17479.0
185 United States 488.0 2805.0 University of Rochester 186 North America 2569.0 9636.0
186 Spain 181.0 2976.0 Universidad Autónoma de Madrid 187 Europe 2738.0 26511.0
187 Italy 153.0 4195.0 Alma Mater Studiorum - University of Bologna 188 Europe 2990.0 63399.0
188 United Kingdom 422.0 4077.0 University of Reading 188 Europe 1280.0 13330.0
189 India 423.0 47.0 Indian Institute of Science (IISc) Bangalore 190 Asia 423.0 4071.0
190 South Africa 379.0 3325.0 University of Cape Town 191 Africa 1733.0 19593.0
191 Italy 1.0 48.0 Scuola Normale Superiore di Pisa 192 Europe 86.0 532.0
192 Italy 1.0 135.0 Scuola Superiore Sant'Anna Pisa di Studi Unive... 192 Europe 119.0 789.0
193 Japan 261.0 2178.0 Keio University CEMS MIM 192 Asia 3905.0 33500.0
194 Sweden 646.0 2636.0 Stockholm University 195 Europe 2154.0 28281.0
195 Spain 230.0 3848.0 Universitat Autònoma de Barcelona 195 Europe 2187.0 31986.0
196 United States 206.0 4900.0 Texas A&M University 195 North America 3446.0 60294.0
197 Mexico 821.0 1412.0 Instituto Tecnológico y de Estudios Superiores... 198 Latin America 1822.0 13376.0
198 Netherlands 502.0 8234.0 Maastricht University 199 Europe 1277.0 16385.0
199 Chile 127.0 2134.0 Universidad de Chile 200 Latin America 2256.0 38848.0

200 rows × 8 columns

Rank according to the ratio between faculty members and students


In [11]:
df['staff_student_ratio'] = df['total_faculty'] / df['total_student']
df.sort_values(['staff_student_ratio'], ascending=[False])[['name', 'rank', 'staff_student_ratio']]


Out[11]:
name rank staff_student_ratio
3 California Institute of Technology (Caltech) 4 0.422616
15 Yale University 16 0.398323
5 University of Oxford 6 0.342292
4 University of Cambridge 5 0.292488
16 Johns Hopkins University 17 0.276353
1 Stanford University 2 0.269870
0 Massachusetts Institute of Technology (MIT) 1 0.269450
185 University of Rochester 186 0.266604
18 University of Pennsylvania 19 0.266437
17 Columbia University 18 0.247115
7 Imperial College London 8 0.244251
116 Technical University of Denmark 116 0.238455
72 University of Copenhagen 73 0.232977
28 Northwestern University 28 0.226560
94 Lomonosov Moscow State University 95 0.221910
70 Pohang University of Science And Technology (P... 71 0.213025
58 Ecole Polytechnique 59 0.207614
6 UCL (University College London) 7 0.204151
99 Washington University in St. Louis 100 0.200312
147 Emory University 147 0.199191
2 Harvard University 3 0.193945
181 Vrije Universiteit Brussel (VUB) 182 0.193020
21 Duke University 21 0.191775
73 University of Zurich 73 0.191713
75 Tohoku University 76 0.191339
156 École Normale Supérieure de Lyon 157 0.189109
141 University of Pittsburgh 142 0.188762
8 University of Chicago 9 0.180645
123 Wageningen University 124 0.179362
35 Kyoto University 36 0.176722
... ... ... ...
161 National Tsing Hua University 161 0.073434
49 The University of Sydney 50 0.071983
140 RWTH Aachen University 141 0.071344
32 University of California, Los Angeles (UCLA) 33 0.071272
125 Ghent University 125 0.070062
106 KIT, Karlsruhe Institute of Technology 107 0.070051
81 The University of Auckland 82 0.069060
195 Universitat Autònoma de Barcelona 195 0.068374
150 University of Otago 151 0.068314
149 Michigan State University 149 0.064159
133 University of California, Santa Barbara (UCSB) 134 0.062594
171 Indian Institute of Technology Delhi (IITD) 172 0.062324
68 University of Illinois at Urbana-Champaign 69 0.060198
164 University of California, Irvine 164 0.059905
119 Humboldt-Universität zu Berlin 120 0.059220
66 University of Texas at Austin 67 0.059027
199 Universidad de Chile 200 0.058072
196 Texas A&M University 195 0.057153
59 Monash University 60 0.055787
71 KU Leuven 71 0.055229
175 University of Technology Sydney 176 0.054860
69 Georgia Institute of Technology 70 0.053743
152 Université catholique de Louvain (UCL) 153 0.050212
151 University of Waterloo 152 0.047523
187 Alma Mater Studiorum - University of Bologna 188 0.047162
60 University of Washington 61 0.043600
124 Freie Universitaet Berlin 125 0.041778
169 Politecnico di Milano 170 0.041349
184 Vienna University of Technology 182 0.040391
51 New York University (NYU) 52 NaN

200 rows × 3 columns

For clarity, we only show the universities with the 10 highest ratio.


In [12]:
df.sort_values('staff_student_ratio', ascending=False)[['name', 'staff_student_ratio']].head(10).plot(title='Rank according to the ratio between faculty members and students',
                                                                                                      figsize=(13,6),kind='bar', x = 'name')


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce3c5d92e8>

We note that most of the universities with a high ratio are also highly ranked.

Rank according to the ratio of international students


In [13]:
df['international_student_ratio'] = df['international_student'] / df['total_student']
df.sort_values(['international_student_ratio'], ascending=[False])[['name', 'rank', 'international_student_ratio']]


Out[13]:
name rank international_student_ratio
34 London School of Economics and Political Scien... 35 0.691393
11 Ecole Polytechnique Fédérale de Lausanne (EPFL) 12 0.570047
7 Imperial College London 8 0.543567
198 Maastricht University 199 0.502533
47 Carnegie Mellon University 47 0.478062
6 UCL (University College London) 7 0.477928
91 University of St Andrews 92 0.457955
41 The University of Melbourne 41 0.427434
126 Queen Mary University of London 127 0.421816
25 The University of Hong Kong 26 0.407144
23 King's College London 23 0.397435
112 University of Groningen 113 0.391303
56 The University of Warwick 57 0.387963
97 University of Geneva 98 0.385256
19 The Australian National University 20 0.384365
110 The University of Adelaide 109 0.382015
9 ETH Zurich - Swiss Federal Institute of Techno... 10 0.381681
22 The University of Edinburgh 23 0.376284
5 University of Oxford 6 0.372870
33 The University of Manchester 34 0.368530
49 The University of Sydney 50 0.364840
134 Lancaster University 135 0.362927
44 The University of New South Wales (UNSW Sydney) 45 0.359240
59 Monash University 60 0.358296
4 University of Cambridge 5 0.356899
48 City University of Hong Kong 49 0.354221
157 University of Aberdeen 158 0.353686
82 The University of Sheffield 82 0.343885
64 University of Glasgow 65 0.342179
173 University of Liverpool 173 0.339995
... ... ... ...
114 Nanjing University 114 0.091091
191 Scuola Normale Superiore di Pisa 192 0.090226
75 Tohoku University 76 0.089976
183 University of Colorado Boulder 182 0.088653
174 University of Virginia 173 0.088080
108 Utrecht University 109 0.087265
35 Kyoto University 36 0.086620
161 National Tsing Hua University 161 0.085457
36 Seoul National University 36 0.083060
196 Texas A&M University 195 0.081268
122 Hokkaido University 122 0.076743
79 University of North Carolina, Chapel Hill 80 0.076223
61 Shanghai Jiao Tong University 62 0.073412
163 University of Bergen 164 0.071877
187 Alma Mater Studiorum - University of Bologna 188 0.066168
193 Keio University CEMS MIM 192 0.065015
144 The Hebrew University of Jerusalem 145 0.062549
40 KAIST - Korea Advanced Institute of Science & ... 41 0.059434
102 University of Helsinki 102 0.055043
199 Universidad de Chile 200 0.054932
70 Pohang University of Science And Technology (P... 71 0.040423
137 Pontificia Universidad Católica de Chile (UC) 137 0.036700
182 Universidade Estadual de Campinas (Unicamp) 182 0.036354
96 University of Science and Technology of China 97 0.034182
120 Universidade de São Paulo 121 0.031745
121 Universidad Nacional Autónoma de México (UNAM) 122 0.015309
189 Indian Institute of Science (IISc) Bangalore 190 0.011545
171 Indian Institute of Technology Delhi (IITD) 172 0.010699
178 Indian Institute of Technology Bombay (IITB) 179 0.009892
51 New York University (NYU) 52 NaN

200 rows × 3 columns

For clarity, we only show the universities with the 10 highest ratio.


In [14]:
df.sort_values('international_student_ratio', ascending=False)[['name', 'international_student_ratio']].head(10).plot(title='Rank according to the ratio of international students',figsize=(13,6), kind='bar', x = 'name')


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce3c565390>

We can note that again, a lot of higly ranked universities also have a high ratio, but in this case, it's not as much apparent as in the previous ratio.

Rank according to the ratio between faculty members and students, grouped by country


In [15]:
df_staff_country = df.groupby('country').mean().sort_values('staff_student_ratio', ascending=False)[['staff_student_ratio']]
df_staff_country


Out[15]:
staff_student_ratio
country
Russia 0.221910
Denmark 0.186580
Saudi Arabia 0.175828
Singapore 0.162279
Japan 0.155840
Malaysia 0.153893
United States 0.151151
South Korea 0.149356
France 0.144006
Israel 0.136047
United Kingdom 0.135913
Switzerland 0.134872
Argentina 0.134267
Hong Kong 0.128988
Norway 0.122510
Mexico 0.121204
China 0.116318
Finland 0.110731
Netherlands 0.105311
Sweden 0.104255
Canada 0.103298
Italy 0.100247
Germany 0.098542
Belgium 0.092131
South Africa 0.088450
India 0.086467
Spain 0.085209
Ireland 0.082180
Taiwan 0.080311
Brazil 0.079505
Australia 0.076407
Chile 0.070883
New Zealand 0.068687
Austria 0.057298

In [16]:
df_staff_country['staff_student_ratio'].plot(title='Rank according to the ratio between faculty members and students, grouped by country', figsize=(16,6),kind='bar')


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce3c423c88>

Here we can note some surprising results. A lot of countries that have high ratios do not seem to have a lot of universities that are highly ranked. In fact we believe that for these countries, only a few or even a single university is in the ranking. Given that the university would be the best university in that country, it is understandable that their ratio is high.

Rank according to the ratio between faculty members and students, grouped by region


In [17]:
df_staff_region = df.groupby('region').mean().sort_values('staff_student_ratio', ascending=False)[['staff_student_ratio']]
df_staff_region


Out[17]:
staff_student_ratio
region
North America 0.144831
Asia 0.134673
Europe 0.120003
Latin America 0.096779
Africa 0.088450
Oceania 0.075003

In [18]:
df_staff_region['staff_student_ratio'].plot(title = 'Rank according to the ratio between faculty members and students, grouped by region',figsize=(13,6),kind='bar')


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce3be970f0>

Here we note that the highest ratios seem to be in regions that have the most of highly ranked universities.

Rank according to the ratio of international students, grouped by country


In [19]:
df_int_country = df.groupby('country').mean().sort_values('international_student_ratio', ascending=False)[['international_student_ratio']]
df_int_country


Out[19]:
international_student_ratio
country
United Kingdom 0.351308
Australia 0.346878
Switzerland 0.313816
Hong Kong 0.312148
Austria 0.306095
Singapore 0.277091
Canada 0.252604
New Zealand 0.248971
Netherlands 0.245456
Ireland 0.241791
Argentina 0.221658
France 0.213286
United States 0.194802
Malaysia 0.194168
South Africa 0.169703
Russia 0.168624
Sweden 0.167998
Germany 0.165300
Denmark 0.164809
Saudi Arabia 0.163742
Belgium 0.160108
Spain 0.121967
Italy 0.120418
Norway 0.111379
China 0.107560
Finland 0.102890
South Korea 0.102453
Taiwan 0.091680
Japan 0.090000
Israel 0.062549
Mexico 0.060436
Chile 0.045816
Brazil 0.034050
India 0.010712

In [20]:
df_int_country['international_student_ratio'].plot(title='Rank according to the ratio of international students, grouped by country', figsize=(16,6),kind='bar')


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce3be662b0>

We note that the countries that seem to have the highest ratios of international students also seem to be the most attractive in terms of their location and/or wealth.

Rank according to the ratio of international students, grouped by region


In [21]:
df_int_region = df.groupby('region').mean().sort_values('international_student_ratio', ascending=False)[['international_student_ratio']]
df_int_region


Out[21]:
international_student_ratio
region
Oceania 0.329077
Europe 0.245932
North America 0.202437
Africa 0.169703
Asia 0.132394
Latin America 0.071751

In [22]:
df_int_region['international_student_ratio'].plot(title='Rank according to the ratio of international students, grouped by region', figsize=(13,6),kind='bar')


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce3c3911d0>

The results by region seem to confirm the results by country. Again, attractive locations and rich countries seem to attract the most international students. Latin America and Asia however are an exception.

For the next part, where we look at the rankings of universities according to Times Higher Education, we need to define a mapping of countries and regions, since the second rankings do not contain region data and we would like to have it in our tables.


In [23]:
country_region = dict(df[['country', 'region']].groupby(['country', 'region']).groups.keys())

2. With Times Higher Education

Similarly to the previous part, we notice that all the relevant data can be obtained from a simple text file in JSON format. Moreover we see that the ratios we are interested in are already present in this file and need not be calculated.


In [24]:
r2 = requests.get('https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json')
raw_data2 = json.loads(r2.text)['data'][:200]

When loading the university data from the webpage, we have to adapt it so that it matches the format from the other ranking. One problem we have is that in Times Higher Education, Russia is referred to as the Russian Federation, so we change it back to Russia. Moreover in the dictionary we created for the countries and regions, we do not have a region value for Luxemburg since it doesn't appear in the first ranking. So we manually put the region to Europe.


In [25]:
unis2 = []
for uni in raw_data2:
    name = uni['name']
    rank = uni['rank'].replace('=', '')
    country = uni['location']
    if country == 'Russian Federation':
        country = 'Russia'
    int_students = uni['stats_pc_intl_students'].replace('%', '')
    staff_student = uni['stats_student_staff_ratio']
    
    info = {'name' : name, 'rank': rank, 'country': country, 'region' : country_region.get(country, 'Europe'),
            'international_student_ratio' : int(int_students) / 100.0, 'staff_student_ratio': 1 / float(staff_student)}
    unis2.append(info)

As before, we put our data in a new JSON file to avoid reloading the file each time we run the notebook and generating new requests.


In [26]:
# Uncomment and run this if you want to regenerate the JSON
"""with open('data2.json', 'w') as f:
    json.dump(unis2, f)"""


Out[26]:
"with open('data2.json', 'w') as f:\n    json.dump(unis2, f)"

In [27]:
with open('data2.json', 'r') as f:
    all_unis2 = json.load(f)

df2 = pd.DataFrame(all_unis2)
df2


Out[27]:
country international_student_ratio name rank region staff_student_ratio
0 United Kingdom 0.38 University of Oxford 1 Europe 0.089286
1 United Kingdom 0.35 University of Cambridge 2 Europe 0.091743
2 United States 0.27 California Institute of Technology 3 North America 0.153846
3 United States 0.22 Stanford University 3 North America 0.133333
4 United States 0.34 Massachusetts Institute of Technology 5 North America 0.114943
5 United States 0.26 Harvard University 6 North America 0.112360
6 United States 0.24 Princeton University 7 North America 0.120482
7 United Kingdom 0.55 Imperial College London 8 Europe 0.087719
8 United States 0.25 University of Chicago 9 North America 0.161290
9 Switzerland 0.38 ETH Zurich – Swiss Federal Institute of Techno... 10 Europe 0.068493
10 United States 0.20 University of Pennsylvania 10 North America 0.153846
11 United States 0.21 Yale University 12 North America 0.232558
12 United States 0.24 Johns Hopkins University 13 North America 0.232558
13 United States 0.32 Columbia University 14 North America 0.163934
14 United States 0.17 University of California, Los Angeles 15 North America 0.104167
15 United Kingdom 0.49 University College London 16 Europe 0.095238
16 United States 0.22 Duke University 17 North America 0.222222
17 United States 0.17 University of California, Berkeley 18 North America 0.076336
18 United States 0.24 Cornell University 19 North America 0.102041
19 United States 0.18 Northwestern University 20 North America 0.078125
20 United States 0.16 University of Michigan 21 North America 0.116279
21 Singapore 0.30 National University of Singapore 22 Asia 0.058824
22 Canada 0.17 University of Toronto 22 North America 0.053476
23 United States 0.45 Carnegie Mellon University 24 North America 0.074074
24 United Kingdom 0.71 London School of Economics and Political Science 25 Europe 0.081967
25 United States 0.16 University of Washington 25 North America 0.087719
26 United Kingdom 0.40 University of Edinburgh 27 Europe 0.080000
27 United States 0.26 New York University 27 North America 0.104167
28 China 0.16 Peking University 27 Asia 0.112360
29 China 0.09 Tsinghua University 30 Asia 0.072993
... ... ... ... ... ... ...
170 South Africa 0.18 University of Cape Town 171 Africa 0.085470
171 United States 0.10 Rutgers, the State University of New Jersey 172 North America 0.090090
172 Sweden 0.17 KTH Royal Institute of Technology 173 Europe 0.062500
173 Germany 0.07 University of Münster 173 Europe 0.023474
174 Belgium 0.35 Université Libre de Bruxelles 175 Europe 0.023041
175 United Kingdom 0.30 Newcastle University 175 Europe 0.064103
176 United Kingdom 0.35 University of Liverpool 177 Europe 0.085470
177 China 0.06 Zhejiang University 177 Asia 0.070922
178 Luxembourg 0.57 University of Luxembourg 179 Europe 0.068493
179 Netherlands 0.27 University of Twente 179 Europe 0.078125
180 France 0.16 Paris-Sud University 181 Europe 0.068966
181 France 0.13 École Normale Supérieure de Lyon 182 Europe 0.117647
182 Hong Kong 0.25 Hong Kong Polytechnic University 182 Asia 0.037175
183 Italy 0.07 Scuola Normale Superiore di Pisa 184 Europe 0.192308
184 United Kingdom 0.36 University of Aberdeen 185 Europe 0.071942
185 United States 0.16 University of Miami 186 North America 0.169492
186 United Kingdom 0.24 University of Dundee 187 Europe 0.067114
187 United Kingdom 0.30 University of East Anglia 188 Europe 0.073529
188 China 0.06 Shanghai Jiao Tong University 188 Asia 0.080645
189 Finland 0.20 Aalto University 190 Europe 0.049505
190 United States 0.14 University of Massachusetts 191 North America 0.075188
191 New Zealand 0.29 University of Auckland 192 Oceania 0.053191
192 United States 0.32 Northeastern University 193 North America 0.071429
193 Russia 0.22 Lomonosov Moscow State University 194 Europe 0.136986
194 Netherlands 0.13 Tilburg University 195 Europe 0.044444
195 France 0.15 Paris-Sorbonne University – Paris 4 196 Europe 0.033113
196 United Kingdom 0.40 Royal Holloway, University of London 197 Europe 0.064935
197 United States 0.14 University of California, Riverside 198 North America 0.048309
198 Sweden 0.13 University of Gothenburg 198 Europe 0.100000
199 Taiwan 0.08 National Taiwan University 198 Asia 0.086957

200 rows × 6 columns

Rank according to the ratio between faculty members and students


In [28]:
df2[['name', 'staff_student_ratio']].sort_values('staff_student_ratio', ascending=False)


Out[28]:
name staff_student_ratio
105 Vanderbilt University 0.303030
109 University of Copenhagen 0.243902
153 University of Rochester 0.232558
11 Yale University 0.232558
12 Johns Hopkins University 0.232558
97 Emory University 0.227273
16 Duke University 0.222222
114 École Polytechnique 0.196078
183 Scuola Normale Superiore di Pisa 0.192308
101 University of Pittsburgh 0.169492
135 University of Zurich 0.169492
185 University of Miami 0.169492
152 Technical University of Denmark 0.166667
13 Columbia University 0.163934
8 University of Chicago 0.161290
10 University of Pennsylvania 0.153846
2 California Institute of Technology 0.153846
88 Dartmouth College 0.151515
45 University of Tokyo 0.149254
157 Case Western Reserve University 0.140845
193 Lomonosov Moscow State University 0.136986
50 Washington University in St Louis 0.133333
3 Stanford University 0.133333
131 University of Science and Technology of China 0.121951
6 Princeton University 0.120482
181 École Normale Supérieure de Lyon 0.117647
69 Boston University 0.116279
20 University of Michigan 0.116279
122 Georgetown University 0.116279
73 Kyoto University 0.114943
... ... ...
134 Stockholm University 0.040161
132 Karlsruhe Institute of Technology 0.038023
31 University of Melbourne 0.037594
84 University of New South Wales 0.037453
182 Hong Kong Polytechnic University 0.037175
112 University of Göttingen 0.035842
52 University of California, Santa Barbara 0.035587
156 Ulm University 0.035211
195 Paris-Sorbonne University – Paris 4 0.033113
80 Monash University 0.029674
166 University of Würzburg 0.029499
133 University of Adelaide 0.029155
64 University of Queensland 0.028011
155 TU Dresden 0.027855
106 Ghent University 0.027855
46 KU Leuven 0.026954
128 Université Catholique de Louvain 0.026455
93 University of Tübingen 0.025974
81 University of Freiburg 0.024814
173 University of Münster 0.023474
174 Université Libre de Bruxelles 0.023041
144 University of Cologne 0.019646
40 Technical University of Munich 0.018382
61 Humboldt University of Berlin 0.017699
87 Free University of Berlin 0.017544
78 RWTH Aachen University 0.017331
163 University of Erlangen-Nuremberg 0.017094
91 Technical University of Berlin 0.015552
124 University of Mannheim 0.013423
99 University of Bonn 0.012837

200 rows × 2 columns

For clarity, we only show the universities with the 10 highest ratio.


In [29]:
df2.sort_values('staff_student_ratio', ascending=False)[['name', 'staff_student_ratio']].head(10).plot(title='Rank according to the ratio between faculty members and students', figsize=(13,6),kind='bar', x = 'name')


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce3b7365c0>

For this ranking, we see quite a difference with the previous ranking. Here, universities with a high ratio are not necessarily the universities with the highest ranking.

Rank according to the ratio of international students


In [30]:
df2.sort_values('international_student_ratio', ascending=False)[['name', 'international_student_ratio']]


Out[30]:
name international_student_ratio
24 London School of Economics and Political Science 0.71
178 University of Luxembourg 0.57
37 École Polytechnique Fédérale de Lausanne 0.55
7 Imperial College London 0.55
102 Maastricht University 0.50
15 University College London 0.49
143 University of St Andrews 0.48
23 Carnegie Mellon University 0.45
120 Queen Mary University of London 0.45
39 University of Hong Kong 0.42
35 King’s College London 0.41
26 University of Edinburgh 0.40
31 University of Melbourne 0.40
196 Royal Holloway, University of London 0.40
130 University of Geneva 0.40
90 University of Warwick 0.39
0 University of Oxford 0.38
54 University of Manchester 0.38
9 ETH Zurich – Swiss Federal Institute of Techno... 0.38
158 University of Leicester 0.38
148 University of Sussex 0.38
149 Lancaster University 0.38
184 University of Aberdeen 0.36
103 University of Sheffield 0.36
114 École Polytechnique 0.36
79 University of Glasgow 0.36
1 University of Cambridge 0.35
119 City University of Hong Kong 0.35
47 Australian National University 0.35
176 University of Liverpool 0.35
... ... ...
168 Nanjing University 0.10
144 University of Cologne 0.10
45 University of Tokyo 0.10
48 University of Texas at Austin 0.10
53 University of California, Davis 0.10
68 University of Maryland, College Park 0.10
101 University of Pittsburgh 0.10
106 Ghent University 0.10
115 Fudan University 0.10
159 Texas A&M University 0.09
160 University of Arizona 0.09
29 Tsinghua University 0.09
166 University of Würzburg 0.09
142 University of Florida 0.09
95 Korea Advanced Institute of Science and Techno... 0.09
73 Kyoto University 0.08
56 University of North Carolina at Chapel Hill 0.08
134 Stockholm University 0.08
199 National Taiwan University 0.08
183 Scuola Normale Superiore di Pisa 0.07
67 Utrecht University 0.07
165 Vrije Universiteit Amsterdam 0.07
173 University of Münster 0.07
167 University of Alabama at Birmingham 0.06
177 Zhejiang University 0.06
188 Shanghai Jiao Tong University 0.06
89 University of Helsinki 0.06
161 University of California, Santa Cruz 0.04
136 Pohang University of Science and Technology 0.04
131 University of Science and Technology of China 0.03

200 rows × 2 columns

For clarity, we only show the universities with the 10 highest ratio.


In [31]:
df2.sort_values('international_student_ratio', ascending=False)[['name', 'international_student_ratio']].head(10).plot(title='Rank according to the ratio of international students', figsize=(13,6),kind='bar', x = 'name')


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce3b84a7b8>

The results for this ranking seem to be very similar to the other ranking with the exception of additional unversities that were not present in the other ranking.

Rank according to the ratio between faculty members and students, grouped by country


In [32]:
df_staff_country2 = df2.groupby('country').mean().sort_values('staff_student_ratio', ascending=False)[['staff_student_ratio']]
df_staff_country2


Out[32]:
staff_student_ratio
country
Denmark 0.160166
Italy 0.138889
Russia 0.136986
Japan 0.132098
United States 0.106390
France 0.099352
China 0.087420
Taiwan 0.086957
South Africa 0.085470
Switzerland 0.083447
South Korea 0.080220
United Kingdom 0.074252
Sweden 0.072475
Luxembourg 0.068493
Spain 0.062798
Singapore 0.060087
Hong Kong 0.056622
Netherlands 0.056504
Finland 0.055617
Norway 0.053763
New Zealand 0.053191
Canada 0.052726
Austria 0.048077
Ireland 0.043860
Australia 0.038233
Germany 0.027957
Belgium 0.026076

In [33]:
df_staff_country2['staff_student_ratio'].plot(title='Rank according to the ratio between faculty members and students, grouped by country', figsize=(16,6),kind='bar')


Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce3b85d470>

We note similar results as in the first ranking.

Rank according to the ratio between faculty members and students, grouped by region


In [34]:
df_staff_region2 = df2.groupby('region').mean().sort_values('staff_student_ratio', ascending=False)[['staff_student_ratio']]
df_staff_region2


Out[34]:
staff_student_ratio
region
North America 0.101655
Africa 0.085470
Asia 0.080345
Europe 0.065952
Oceania 0.039895

In [35]:
df_staff_region2['staff_student_ratio'].plot(title='Rank according to the ratio between faculty members and students, grouped by region', figsize=(13,6),kind='bar')


Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce3b5c2860>

We note a few differences in this result when comparing to the first ranking. First, we see that Africa is now in second place. What happened is that all the regions seem to have a lower average ratio than in the previous ranking, but Africa's average reduced the least. Moreober, we note the dissapearance of Latin America.

Rank according to the ratio of international students, grouped by country


In [36]:
df_int_country2 = df2.groupby('country').mean().sort_values('international_student_ratio', ascending=False)[['international_student_ratio']]
df_int_country2


Out[36]:
international_student_ratio
country
Luxembourg 0.570000
United Kingdom 0.365484
Hong Kong 0.328000
Switzerland 0.314286
Australia 0.307500
Singapore 0.305000
New Zealand 0.290000
Ireland 0.270000
Austria 0.260000
Canada 0.225000
Russia 0.220000
France 0.208333
Belgium 0.197500
Netherlands 0.192308
Norway 0.190000
United States 0.181774
South Africa 0.180000
Denmark 0.166667
Germany 0.152000
Spain 0.140000
Sweden 0.140000
Finland 0.130000
Italy 0.105000
Japan 0.090000
South Korea 0.090000
China 0.085714
Taiwan 0.080000

In [37]:
df_int_country2['international_student_ratio'].plot(title='Rank according to the ratio of international students, grouped by country', figsize=(16,6),kind='bar')


Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce3b6478d0>

The results here are very similar than the previous ranking with the exception of the addition of Luxemburg, which was not present in the previous ranking.

Rank according to the ratio of international students, grouped by region


In [38]:
df_int_region2 = df2.groupby('region').mean().sort_values('international_student_ratio', ascending=False)[['international_student_ratio']]
df_int_region2


Out[38]:
international_student_ratio
region
Oceania 0.305556
Europe 0.244653
North America 0.185588
Africa 0.180000
Asia 0.165238

In [39]:
df_int_region2['international_student_ratio'].plot(title='Rank according to the ratio of international students, grouped by region', figsize=(13,6),kind='bar')


Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce3b896be0>

The results here are again very similar to the first ranking.

3. Merging of the two dataframes

For the third question, the goal is to merge both dataframes, and thus, as sometimes the university names are not the same in both websites, we need to map them to the same value for both to perform the merge correctly.

To solve this problem, we used Google search : we search for each university names in both dataframes, and take the first link output by Google (usually the university website), and then use this URL to perform the join.

As Google blocks any device that performs too many searches, we delayed each search by 5 seconds, and then store the mapping in a JSON file to avoid searching again each time the notebook is run.

With this technique, we have a very high rate of success, with only 3 universities for which both searches didn't output the same link. These universities were fixed manually so if you save the automatically generated JSON again below, you will not have the same mapping as we do in the file that is present in the repository.


In [40]:
mapping = {}

with open('mapping.json', 'r') as f:
    mapping = json.load(f)

In [41]:
def get_url(name):
    r = requests.get('https://encrypted.google.com/search?q=' + name.replace(' ', '+')) # request the Google results page
    soup = BeautifulSoup(r.text, 'html.parser')
    google_url = soup.select('.g a')[0]['href'] # we get the first Google result
    url = google_url[google_url.find("://")+3:google_url.find("&")] # we get the URL of the first result
    
    if url.endswith("/"):
        url = url[0:-1]
    
    time.sleep(5) # the wait needed to avoid getting blocked by Google
    return url

def get_identifier(name):
    if not(name in mapping): #if the name is already in the mapping, no need to run the search again
        mapping[name] = get_url(name)
        
    return mapping[name]

Now that we have the mapping, we can apply it to both datasets and then merge them on the 'url' column.


In [42]:
df['url'] = df['name'].apply(get_identifier)

df2['url'] = df2['name'].apply(get_identifier)

raw_merge = df.merge(df2[['url', 'rank', 'international_student_ratio', 'staff_student_ratio']], how='inner', on='url')
raw_merge.columns = ['Country', 
                     'International faculty', 
                     'International students', 
                     'Name', 
                     'Rank topuniversities', 
                     'Region', 
                     'Total faculty', 
                     'Total students', 
                     'Faculty / students ratio topuniversities', 
                     'International students ratio topuniversities', 
                     'url', 
                     'Rank timeshighereducation', 
                     'International students ratio timeshighereducation', 
                     'Faculty / students ratio timeshighereducation']
raw_merge


Out[42]:
Country International faculty International students Name Rank topuniversities Region Total faculty Total students Faculty / students ratio topuniversities International students ratio topuniversities url Rank timeshighereducation International students ratio timeshighereducation Faculty / students ratio timeshighereducation
0 United States 1679.0 3717.0 Massachusetts Institute of Technology (MIT) 1 North America 2982.0 11067.0 0.269450 0.335863 web.mit.edu 5 0.34 0.114943
1 United States 2042.0 3611.0 Stanford University 2 North America 4285.0 15878.0 0.269870 0.227422 www.stanford.edu 3 0.22 0.133333
2 United States 1311.0 5266.0 Harvard University 3 North America 4350.0 22429.0 0.193945 0.234785 www.harvard.edu 6 0.26 0.112360
3 United States 350.0 647.0 California Institute of Technology (Caltech) 4 North America 953.0 2255.0 0.422616 0.286918 www.caltech.edu 3 0.27 0.153846
4 United Kingdom 2278.0 6699.0 University of Cambridge 5 Europe 5490.0 18770.0 0.292488 0.356899 www.cam.ac.uk 2 0.35 0.091743
5 United Kingdom 2964.0 7353.0 University of Oxford 6 Europe 6750.0 19720.0 0.342292 0.372870 www.ox.ac.uk 1 0.38 0.089286
6 United Kingdom 2554.0 14854.0 UCL (University College London) 7 Europe 6345.0 31080.0 0.204151 0.477928 www.ucl.ac.uk 16 0.49 0.095238
7 United Kingdom 2071.0 8746.0 Imperial College London 8 Europe 3930.0 16090.0 0.244251 0.543567 www.imperial.ac.uk 8 0.55 0.087719
8 United States 635.0 3379.0 University of Chicago 9 North America 2449.0 13557.0 0.180645 0.249244 www.uchicago.edu 9 0.25 0.161290
9 Switzerland 1886.0 7563.0 ETH Zurich - Swiss Federal Institute of Techno... 10 Europe 2477.0 19815.0 0.125006 0.381681 www.ethz.ch/en.html 10 0.38 0.068493
10 Singapore 2993.0 7251.0 Nanyang Technological University, Singapore (NTU) 11 Asia 4338.0 25738.0 0.168545 0.281724 www.ntu.edu.sg 52 0.31 0.061350
11 Switzerland 1300.0 5896.0 Ecole Polytechnique Fédérale de Lausanne (EPFL) 12 Europe 1695.0 10343.0 0.163879 0.570047 www.epfl.ch 38 0.55 0.089286
12 United States 246.0 1793.0 Princeton University 13 North America 1007.0 8069.0 0.124799 0.222208 www.princeton.edu 7 0.24 0.120482
13 United States 970.0 5411.0 Cornell University 14 North America 2718.0 21904.0 0.124087 0.247033 www.cornell.edu 19 0.24 0.102041
14 Singapore 3086.0 8917.0 National University of Singapore (NUS) 15 Asia 5106.0 32728.0 0.156013 0.272458 www.nus.edu.sg 22 0.30 0.058824
15 United States 1708.0 2469.0 Yale University 16 North America 4940.0 12402.0 0.398323 0.199081 www.yale.edu 12 0.21 0.232558
16 United States 1061.0 4105.0 Johns Hopkins University 17 North America 4462.0 16146.0 0.276353 0.254243 www.jhu.edu 13 0.24 0.232558
17 United States 913.0 8105.0 Columbia University 18 North America 6189.0 25045.0 0.247115 0.323617 www.columbia.edu 14 0.32 0.163934
18 United States 1383.0 4250.0 University of Pennsylvania 19 North America 5499.0 20639.0 0.266437 0.205921 www.upenn.edu 10 0.20 0.153846
19 Australia 927.0 5551.0 The Australian National University 20 Oceania 1600.0 14442.0 0.110788 0.384365 www.anu.edu.au 48 0.35 0.051813
20 United States 1920.0 7527.0 University of Michigan 21 North America 6809.0 43147.0 0.157809 0.174450 umich.edu 21 0.16 0.116279
21 United States 226.0 2864.0 Duke University 21 North America 2938.0 15320.0 0.191775 0.186945 www.duke.edu 17 0.22 0.222222
22 United Kingdom 1553.0 10551.0 The University of Edinburgh 23 Europe 4075.0 28040.0 0.145328 0.376284 www.ed.ac.uk 27 0.40 0.080000
23 United Kingdom 1513.0 9451.0 King's College London 23 Europe 3650.0 23780.0 0.153490 0.397435 www.kcl.ac.uk 36 0.41 0.083333
24 China 932.0 4072.0 Tsinghua University 25 Asia 5506.0 36300.0 0.151680 0.112176 www.tsinghua.edu.cn/publish/newthuen 30 0.09 0.072993
25 Hong Kong 2085.0 8230.0 The University of Hong Kong 26 Asia 3012.0 20214.0 0.149006 0.407144 www.hku.hk 40 0.42 0.055556
26 United States 1395.0 6086.0 University of California, Berkeley (UCB) 27 North America 3321.0 36703.0 0.090483 0.165818 berkeley.edu 18 0.17 0.076336
27 Japan 254.0 2639.0 The University of Tokyo 28 Asia 4514.0 27279.0 0.165475 0.096741 www.u-tokyo.ac.jp/en 46 0.10 0.149254
28 United States 1141.0 3422.0 Northwestern University 28 North America 4231.0 18675.0 0.226560 0.183240 www.northwestern.edu 20 0.18 0.078125
29 Hong Kong 835.0 2921.0 The Hong Kong University of Science and Techno... 30 Asia 1150.0 10375.0 0.110843 0.281542 www.ust.hk 44 0.31 0.043290
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
128 Switzerland 1031.0 3337.0 University of Lausanne 146 Europe 2173.0 13435.0 0.161742 0.248381 www.unil.ch/central/en/home.html 152 0.25 0.073529
129 Netherlands 458.0 5229.0 Erasmus University Rotterdam 147 Europe 2011.0 23273.0 0.086409 0.224681 www.eur.nl/english 72 0.20 0.049751
130 United States 484.0 2154.0 Emory University 147 North America 2560.0 12852.0 0.199191 0.167600 www.emory.edu 98 0.19 0.227273
131 Switzerland 684.0 3441.0 University of Basel 149 Europe 1057.0 12852.0 0.082244 0.267740 www.unibas.ch/en.html 95 0.27 0.056497
132 United States 653.0 7427.0 Michigan State University 149 North America 2884.0 44951.0 0.064159 0.165224 msu.edu 83 0.17 0.058824
133 Belgium 406.0 4468.0 Université catholique de Louvain (UCL) 153 Europe 1219.0 24277.0 0.050212 0.184043 uclouvain.be 129 0.19 0.026455
134 Austria 1400.0 14468.0 University of Vienna 154 Europe 3411.0 45967.0 0.074205 0.314748 www.univie.ac.at/en 165 0.26 0.048077
135 France 91.0 221.0 École Normale Supérieure de Lyon 157 Europe 382.0 2020.0 0.189109 0.109406 www.ens-lyon.fr 182 0.13 0.117647
136 United Kingdom 508.0 4246.0 University of Aberdeen 158 Europe 1310.0 12005.0 0.109121 0.353686 www.abdn.ac.uk 185 0.36 0.071942
137 United Kingdom 597.0 5504.0 The University of Exeter 158 Europe 1756.0 20749.0 0.084631 0.265266 www.exeter.ac.uk 130 0.28 0.057143
138 United Kingdom 835.0 5717.0 Newcastle University 161 Europe 2428.0 21673.0 0.112029 0.263784 www.ncl.ac.uk 175 0.30 0.064103
139 United States 195.0 6542.0 University of Minnesota 163 North America 4919.0 59742.0 0.082337 0.109504 twin-cities.umn.edu 56 0.11 0.057471
140 United States 377.0 8538.0 University of California, Irvine 164 North America 1926.0 32151.0 0.059905 0.265559 uci.edu 99 0.21 0.058824
141 Germany 796.0 3860.0 Eberhard Karls Universität Tübingen 164 Europe 3844.0 27366.0 0.140466 0.141051 www.uni-tuebingen.de/en/university.html 94 0.13 0.025974
142 Switzerland 661.0 1618.0 University of Bern 167 Europe 1172.0 10758.0 0.108942 0.150400 www.unibe.ch/index_eng.html 105 0.15 0.060606
143 United States 27.0 945.0 Dartmouth College 169 North America 939.0 6224.0 0.150868 0.151832 home.dartmouth.edu 89 0.15 0.151515
144 United Kingdom 717.0 7502.0 University of Liverpool 173 Europe 2390.0 22065.0 0.108316 0.339995 www.liverpool.ac.uk 177 0.35 0.085470
145 United States 487.0 1899.0 University of Virginia 173 North America 2509.0 21560.0 0.116373 0.088080 www.virginia.edu 113 0.12 0.103093
146 United States 504.0 4597.0 University of Florida 178 North America 5419.0 45485.0 0.119138 0.101066 earch?q=University+of+Florida 143 0.09 0.057143
147 Netherlands 299.0 2235.0 University of Twente 179 Europe 910.0 9332.0 0.097514 0.239498 www.utwente.nl/en 179 0.27 0.078125
148 Germany 641.0 3736.0 University of Göttingen 181 Europe 3678.0 30402.0 0.120979 0.122887 www.uni-goettingen.de/en/1.html 113 0.12 0.035842
149 United States 566.0 2679.0 University of Colorado Boulder 182 North America 3685.0 30219.0 0.121943 0.088653 www.colorado.edu 100 0.10 0.061350
150 United States 488.0 2805.0 University of Rochester 186 North America 2569.0 9636.0 0.266604 0.291096 www.rochester.edu 153 0.29 0.232558
151 South Africa 379.0 3325.0 University of Cape Town 191 Africa 1733.0 19593.0 0.088450 0.169703 www.uct.ac.za 171 0.18 0.085470
152 Italy 1.0 48.0 Scuola Normale Superiore di Pisa 192 Europe 86.0 532.0 0.161654 0.090226 en.sns.it 184 0.07 0.192308
153 Italy 1.0 135.0 Scuola Superiore Sant'Anna Pisa di Studi Unive... 192 Europe 119.0 789.0 0.150824 0.171103 www.santannapisa.it 155 0.14 0.085470
154 Sweden 646.0 2636.0 Stockholm University 195 Europe 2154.0 28281.0 0.076164 0.093207 www.su.se/english 134 0.08 0.040161
155 Spain 230.0 3848.0 Universitat Autònoma de Barcelona 195 Europe 2187.0 31986.0 0.068374 0.120303 www.uab.cat/web/universitat-autonoma-de-barcel... 147 0.16 0.077519
156 United States 206.0 4900.0 Texas A&M University 195 North America 3446.0 60294.0 0.057153 0.081268 www.tamu.edu 159 0.09 0.042373
157 Netherlands 502.0 8234.0 Maastricht University 199 Europe 1277.0 16385.0 0.077937 0.502533 www.maastrichtuniversity.nl 103 0.50 0.055556

158 rows × 14 columns


In [43]:
# Uncomment and run this if you want to save the JSON
"""with open('mapping.json', 'w') as f:
    json.dump(mapping, f)"""


Out[43]:
"with open('mapping.json', 'w') as f:\n    json.dump(mapping, f)"

4. Find useful insights in the data

We begin by cleaning and adapting our dataset to better see meaningful correlations. For example, we put the ranking of each unversity in the reverse order, since we want to have a higher value for higher ranked universities to get meaningful correlations. Moreover we drop some rows where we miss the data we need. We also add another ratio that we thought to be usefull for the last part of the homework, the international faculy ratio for each unversity.


In [44]:
raw_merge_cleaned = raw_merge.copy()
raw_merge_cleaned['rank_time'] = 200 - raw_merge_cleaned['Rank timeshighereducation'].astype(int)
raw_merge_cleaned['rank_top'] = 200 - raw_merge_cleaned['Rank topuniversities'].astype(int)
raw_merge_cleaned[pd.notnull(raw_merge['International students ratio topuniversities'])]
raw_merge_cleaned['International faculty ratio'] = raw_merge_cleaned['International faculty'] / raw_merge_cleaned['Total faculty']

In [45]:
raw_merge_cleaned.corr()


Out[45]:
International faculty International students Rank topuniversities Total faculty Total students Faculty / students ratio topuniversities International students ratio topuniversities International students ratio timeshighereducation Faculty / students ratio timeshighereducation rank_time rank_top International faculty ratio
International faculty 1.000000 0.579780 -0.516205 0.645592 0.286353 0.316277 0.381998 0.327211 0.013953 0.365093 0.516205 0.600152
International students 0.579780 1.000000 -0.258194 0.414110 0.627124 -0.245137 0.528667 0.417507 -0.338509 0.254831 0.258194 0.323016
Rank topuniversities -0.516205 -0.258194 1.000000 -0.366388 -0.020634 -0.400063 -0.317166 -0.308134 -0.213109 -0.696734 -1.000000 -0.359499
Total faculty 0.645592 0.414110 -0.366388 1.000000 0.588380 0.337531 -0.083055 -0.127747 0.168920 0.364960 0.366388 -0.095990
Total students 0.286353 0.627124 -0.020634 0.588380 1.000000 -0.430598 -0.205709 -0.277173 -0.369667 0.147799 0.020634 -0.191878
Faculty / students ratio topuniversities 0.316277 -0.245137 -0.400063 0.337531 -0.430598 1.000000 0.110572 0.133461 0.696981 0.305359 0.400063 0.073430
International students ratio topuniversities 0.381998 0.528667 -0.317166 -0.083055 -0.205709 0.110572 1.000000 0.954589 -0.074607 0.223201 0.317166 0.657059
International students ratio timeshighereducation 0.327211 0.417507 -0.308134 -0.127747 -0.277173 0.133461 0.954589 1.000000 -0.033478 0.201501 0.308134 0.640265
Faculty / students ratio timeshighereducation 0.013953 -0.338509 -0.213109 0.168920 -0.369667 0.696981 -0.074607 -0.033478 1.000000 0.120484 0.213109 -0.168201
rank_time 0.365093 0.254831 -0.696734 0.364960 0.147799 0.305359 0.223201 0.201501 0.120484 1.000000 0.696734 0.131918
rank_top 0.516205 0.258194 -1.000000 0.366388 0.020634 0.400063 0.317166 0.308134 0.213109 0.696734 1.000000 0.359499
International faculty ratio 0.600152 0.323016 -0.359499 -0.095990 -0.191878 0.073430 0.657059 0.640265 -0.168201 0.131918 0.359499 1.000000

We notice that there is an obvious and natural correlation between the columns corresponding to the numbers of students and faculty members. Of course, it's normal than when there are more students, there are more international students as well and more faculty members. These correlations however are not interesting.

Another uninsteresting observation are the correlations between the ratios in the two rankings.

The more interesting observations are given by the ratios than the numbers.


In [46]:
raw_merge_cleaned[['Faculty / students ratio topuniversities',
                   'International students ratio topuniversities',
                   'International students ratio timeshighereducation',
                   'Faculty / students ratio timeshighereducation',
                   'International faculty ratio',
                   'rank_time',
                   'rank_top']].corr()


Out[46]:
Faculty / students ratio topuniversities International students ratio topuniversities International students ratio timeshighereducation Faculty / students ratio timeshighereducation International faculty ratio rank_time rank_top
Faculty / students ratio topuniversities 1.000000 0.110572 0.133461 0.696981 0.073430 0.305359 0.400063
International students ratio topuniversities 0.110572 1.000000 0.954589 -0.074607 0.657059 0.223201 0.317166
International students ratio timeshighereducation 0.133461 0.954589 1.000000 -0.033478 0.640265 0.201501 0.308134
Faculty / students ratio timeshighereducation 0.696981 -0.074607 -0.033478 1.000000 -0.168201 0.120484 0.213109
International faculty ratio 0.073430 0.657059 0.640265 -0.168201 1.000000 0.131918 0.359499
rank_time 0.305359 0.223201 0.201501 0.120484 0.131918 1.000000 0.696734
rank_top 0.400063 0.317166 0.308134 0.213109 0.359499 0.696734 1.000000

When looking at only the correlations involving the ratios and ranks, we can note a few interesting things. First note the additional ratio that is the ratio of international faculty members (this information comes from the Top Universities ranking only).

First, we can note that the faculty/students ratio from Top Universities has a relatively high correlation with both ranks. There are two possible explanations in our opinion. First we can think that higher ranked universities may have more funds to hire more faculty staff. Secondly, we can also assume that universities with a higher rank may want to keep their high position or improve it by hiring more faculty members.

Another relatively important correlation we see is the correlation between international student ratio and both ranks. This can surely be explained by the attractiveness of high ranked universities to international students.

The highest correlation we observe is the one between international faculty ratio and the international students ratio. Universities that attract international students will also attract international faculty members for probably the same reason.


In [47]:
raw_merge_cleaned[['Total students',
                   'Total faculty',
                   'International faculty',
                   'International students',
                   'rank_time',
                   'rank_top']].corr()


Out[47]:
Total students Total faculty International faculty International students rank_time rank_top
Total students 1.000000 0.588380 0.286353 0.627124 0.147799 0.020634
Total faculty 0.588380 1.000000 0.645592 0.414110 0.364960 0.366388
International faculty 0.286353 0.645592 1.000000 0.579780 0.365093 0.516205
International students 0.627124 0.414110 0.579780 1.000000 0.254831 0.258194
rank_time 0.147799 0.364960 0.365093 0.254831 1.000000 0.696734
rank_top 0.020634 0.366388 0.516205 0.258194 0.696734 1.000000

We can still take a quick look at the total numbers and their correlations. In fact we can see that there is almost no correlation between the total number of students and the rank. Moreover, we see also see that the higher the rank, the more faculty members there are in the university. This confirms our previous intuition when we looked at the ratios.

5. Find the best university

For the last question, we have the following approach. We will create new score of the universities based on the different ratio columns for which we have the correlations. Then we will calculate the weighted average of each of these scores for each university which will define the new ranking. The top scored university of this ranking will our best unversity.

The weights we use will be based on the correlations we found earlier and the explanations that we apply to them. We will apply the following weights for each collumns:

  • Faculty/student ratio: 0.1 for each ranking. We consider this to be the most important factor for the rank for reasons explained above.
  • International student ratios: 0.025 for each ranking. We choose this weight because we consider the ratio to be the mostly the consequence of the ranking: high ranked universities attrach more international students because of their rank.
  • International faculty ratio : 0.15 (we source this statistic from only one of the rankings). This is an important factor for us because we believe that a large number of international faculty members may mean that the university is already good and attracts professors from around the world.
  • Original ranks (normalized between 0 to 1, 1 being the best): 0.3 for each. Since the original ranks are made from other interesting factors that we have not explored here, it's important to take them into account at a high weight.

In [48]:
raw_merge_cleaned['score'] = (0.1 * raw_merge_cleaned['Faculty / students ratio topuniversities'] +
                              0.1 * raw_merge_cleaned['Faculty / students ratio timeshighereducation'] +
                              0.025 * raw_merge_cleaned['International students ratio topuniversities'] +
                              0.025 * raw_merge_cleaned['International students ratio timeshighereducation'] +
                              0.15 * raw_merge_cleaned['International faculty ratio'] + 
                              0.3 * (raw_merge_cleaned['rank_time'] / 200.0) +
                              0.3 * (raw_merge_cleaned['rank_top'] / 200.0))
raw_merge_cleaned[['Name', 'score']].sort_values(by='score', ascending=False).reset_index(drop=True)


Out[48]:
Name score
0 Massachusetts Institute of Technology (MIT) 0.730793
1 ETH Zurich - Swiss Federal Institute of Techno... 0.722603
2 University of Oxford 0.717346
3 California Institute of Technology (Caltech) 0.716158
4 Imperial College London 0.715582
5 Stanford University 0.715488
6 University of Cambridge 0.707836
7 Ecole Polytechnique Fédérale de Lausanne (EPFL) 0.693362
8 Yale University 0.683177
9 UCL (University College London) 0.680015
10 Harvard University 0.674707
11 National University of Singapore (NUS) 0.670953
12 London School of Economics and Political Scien... 0.659094
13 University of Chicago 0.658568
14 Johns Hopkins University 0.653915
15 Nanyang Technological University, Singapore (NTU) 0.646775
16 University of Pennsylvania 0.646401
17 The University of Hong Kong 0.645969
18 Princeton University 0.642727
19 Cornell University 0.638821
20 Columbia University 0.631323
21 The Hong Kong University of Science and Techno... 0.628115
22 The University of Edinburgh 0.624106
23 University of California, Berkeley (UCB) 0.620585
24 The Australian National University 0.619526
25 King's College London 0.617546
26 University of Michigan 0.615067
27 University of Toronto 0.612089
28 Northwestern University 0.608001
29 Duke University 0.606112
... ... ...
128 University of Lausanne 0.260156
129 University of California, Irvine 0.258623
130 Dartmouth College 0.255097
131 The University of Exeter 0.246806
132 Maastricht University 0.244379
133 Université catholique de Louvain (UCL) 0.243977
134 University of Oslo 0.234729
135 Nanjing University 0.232647
136 University of Virginia 0.227264
137 University of Colorado Boulder 0.223085
138 National Taiwan University (NTU) 0.222618
139 Cardiff University 0.221205
140 Lomonosov Moscow State University 0.220445
141 University of Vienna 0.209662
142 University of Göttingen 0.206896
143 University of Rochester 0.184437
144 University of Aberdeen 0.179616
145 Newcastle University 0.179293
146 Aalto University 0.177720
147 Stockholm University 0.167449
148 École Normale Supérieure de Lyon 0.163894
149 University of Liverpool 0.156629
150 University of Florida 0.154856
151 University of Twente 0.142587
152 Universitat Autònoma de Barcelona 0.124372
153 University of Cape Town 0.115939
154 Scuola Superiore Sant'Anna Pisa di Studi Unive... 0.112167
155 Texas A&M University 0.092201
156 Scuola Normale Superiore di Pisa 0.077146
157 New York University (NYU) NaN

158 rows × 2 columns

According to our new ranking, the best university is the Massachusetts Institute of Technology.